The purpose of this notebook is twofold. First, it demonstrates the basic functionality of PyLogit for estimatin Mixed Logit models. Secondly, it compares the estimation results for a Mixed Logit model from PyLogit and MLogit on a common dataset and model specification. The dataset is the "Electricity" dataset. Both the dataset and the model speecification come from Kenneth Train's exercises. See this mlogit pdf.
In [1]:
from collections import OrderedDict # For recording the model specification
import pandas as pd # For file input/output
import numpy as np # For vectorized math operations
import pylogit as pl # For choice model estimation
In [2]:
# Load the raw electricity data
long_electricity = pd.read_csv("../data/electricity_r_data_long.csv")
long_electricity.head().T
Out[2]:
In [3]:
# Make sure that the choice column contains ones and zeros as opposed
# to true and false
long_electricity["choice"] = long_electricity["choice"].astype(int)
# List the variables that are the index variables
index_var_names = ["pf", "cl", "loc", "wk", "tod", "seas"]
# Transform all of the index variable columns to have float dtypes
for col in index_var_names:
long_electricity[col] = long_electricity[col].astype(float)
In [4]:
# Create the model's specification dictionary and variable names dictionary
# NOTE: - Keys should be variables within the long format dataframe.
# The sole exception to this is the "intercept" key.
# - For the specification dictionary, the values should be lists
# or lists of lists. Within a list, or within the inner-most
# list should be the alternative ID's of the alternative whose
# utility specification the explanatory variable is entering.
example_specification = OrderedDict()
example_names = OrderedDict()
# Note that the names used below are simply for consistency with
# the coefficient names given in the mlogit vignette.
for col in index_var_names:
example_specification[col] = [[1, 2, 3, 4]]
example_names[col] = [col]
Compared to estimating a Multinomial Logit Model, creating Mixed Logit models requires the declaration of more arguments.
In particular, we have to specify the identification column over which the coefficients will be randomly distributed. Usually, this is the "observation id" for our dataset. While it is unfortunately named (for now), the "obs_id_col" argument should be used to specify the id of each choice situation. The "mixing_id_col" should be used to specify the id of each unit of observation over which the coefficients are believed to be randomly distributed.
At the moment, PyLogit does not support estimation of Mixed Logit models where some coefficients are mixed over one unit of observation (such as individuals) and other coefficients are mixed over a different unit of observation (such as households).
Beyond, specification of the mixing_id_col, one must also specify the variables that will be treated as random variables. This is done by passing a list containing the names of the coefficients in the model. Note, the strings in the passed list must be present in one of the lists within "names.values()".
When estimating the mixed logit model, we must specify the number of draws to be taken from the distributions of the random coefficients. This is done through the "num_draws" argument of the "fit_mle()" function. Additionally, we can specify a random seed so the results of our estimation are reproducible. This is done through the "seed" argument of the "fit_mle()" function. Finally, the initial values argument should specify enough initial values for the original index coefficients as well as the standard deviation values of the coefficients that are being treated as random variables.
In [5]:
# Provide the module with the needed input arguments to create
# an instance of the Mixed Logit model class.
# Note that "chid" is used as the obs_id_col because "chid" is
# the choice situation id.
# Currently, the obs_id_col argument name is unfortunate because
# in the most general of senses, it refers to the situation id.
# In panel data settings, the mixing_id_col argument is what one
# would generally think of as a "observation id".
# For mixed logit models, the "mixing_id_col" argument specifies
# the units of observation that the coefficients are randomly
# distributed over.
example_mixed = pl.create_choice_model(data=long_electricity,
alt_id_col="alt",
obs_id_col="chid",
choice_col="choice",
specification=example_specification,
model_type="Mixed Logit",
names=example_names,
mixing_id_col="id",
mixing_vars=index_var_names)
# Note 2 * len(index_var_names) is used because we are estimating
# both the mean and standard deviation of each of the random coefficients
# for the listed index variables.
example_mixed.fit_mle(init_vals=np.zeros(2 * len(index_var_names)),
num_draws=600,
seed=123)
# Look at the estimated results
example_mixed.get_statsmodels_summary()
Out[5]:
The mlogit call and output are given below:
> set.seed(123) > electricity_mixl = mlogit(choice~pf+cl+loc+wk+tod+seas|0, + electricity_long, + rpar=c(pf="n", + cl="n", + loc="n", + wk="n", + tod="n", + seas="n"), + R=600, + halton=NULL, + print.level=0, + panel=TRUE) > summary(electricity_mixl) Call: mlogit(formula = choice ~ pf + cl + loc + wk + tod + seas | 0, data = electricity_long, rpar = c(pf = "n", cl = "n", loc = "n", wk = "n", tod = "n", seas = "n"), R = 600, halton = NULL, panel = TRUE, print.level = 0) Frequencies of alternatives: 1 2 3 4 0.22702 0.26393 0.23816 0.27089 bfgs method 23 iterations, 0h:1m:48s g'(-H)^-1g = 1.66E-07 gradient close to zero Coefficients : Estimate Std. Error t-value Pr(>|t|) pf -0.980049 0.035852 -27.336 < 2.2e-16 *** cl -0.214234 0.014205 -15.082 < 2.2e-16 *** loc 2.339543 0.088591 26.408 < 2.2e-16 *** wk 1.596299 0.069847 22.854 < 2.2e-16 *** tod -9.325374 0.304802 -30.595 < 2.2e-16 *** seas -9.643796 0.310437 -31.065 < 2.2e-16 *** sd.pf 0.207022 0.011939 17.340 < 2.2e-16 *** sd.cl 0.393249 0.019352 20.321 < 2.2e-16 *** sd.loc 1.770551 0.099411 17.811 < 2.2e-16 *** sd.wk 1.183782 0.083063 14.252 < 2.2e-16 *** sd.tod 2.282927 0.128163 17.813 < 2.2e-16 *** sd.seas 1.578440 0.124485 12.680 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Log-Likelihood: -3904.7 random coefficients Min. 1st Qu. Median Mean 3rd Qu. Max. pf -Inf -1.1196828 -0.9800489 -0.9800489 -0.84041499 Inf cl -Inf -0.4794766 -0.2142339 -0.2142339 0.05100865 Inf loc -Inf 1.1453244 2.3395430 2.3395430 3.53376158 Inf wk -Inf 0.7978500 1.5962985 1.5962985 2.39474707 Inf tod -Inf -10.8651849 -9.3253739 -9.3253739 -7.78556285 Inf seas -Inf -10.7084372 -9.6437959 -9.6437959 -8.57915462 Inf
Clearly, there are differences between the estimation results of mlogit and those of PyLogit. However these differences are not large, and it is not clear where the differences between the two estimation routines comes from.
The results are close enough to suspect that the estimation differences stem from randomness due to simulation and the fact that one cannot force the randomly generated numbers to be the same across programming languages (R versus Python).
The log-likelihood values are quite close to each other and none of the parameters seem to be actually different from each other (in the sense of not overlapping with each other's 95% confidence interval).
In [ ]: